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ABSTRACT 

Motivation: Predicting protein interactions is one of the more 
interesting challenges of the post-genomic era. IVIany algorithms 
address this problem as a binary classification problem: given two 
proteins represented as two vectors of features, predict if they interact 
or not. Importantly however, computational predictions are only one 
component of a larger framework for identifying PPI. The most 
promising candidate pairs can be validated experimentally by testing 
if they physical bind to each other. Since these experiments are more 
costly and error prone, the computational predictions serve as a filter, 
aimed to produce a small number of highly promising candidates. 
Here we propose to address this problem as a ranking problem: given 
a network with known interactions, rank all unknown pairs based on 
the likelihood of their interactions. 

In this paper we propose a ranking algorithm that trains multiple 
inter-connected models using a passive aggressive on-line approach. 
We show good results predicting protein-protein interactions for 
post synaptic density PPI network. We compare the precision of 
the ranking algorithm with local classifiers iBIe akley et al^ ^2007j 
and classic SVM jVapnik| |1998| Though the ranking algorithm 
outperforms the classic SVM classification, its performance is inferior 
compared to the local supervised method. 

Availability: Interaction inference package is available upon request 
from the authors. 

Contact: asnat.b ar-shira@live.biu.ac.il ||gal.chechik(g)biu.ac.il| 
1 INTRODUCTION 

Revealing the interactions between groups of proteins will 
considerably contribute to our understanding of intracellular 
processes such as signal transduction, trafficking, or regularization. 
The fastest way to explore interactions between proteins is via high 
throughput experiments. In the last decade many high-throughput 
methods, such as yeast two-hybrid (Fields and Song 1989 ), which 
detects protein interactions or mass spectrometry ( Ho et al. 2002 ), 
that identifies components of protein complexes, were developed. 
These methods systematically probe interactions in a large group 
of proteins. But the data obtained by such studies is partial, and 
susceptible to under- and over- detection { [Qi gra/.[|2006| ). Moreover, 
different methods yield different interactomes. In yeast for example. 
Von Mering et al have found that only 2400 out of 80,000 
interactions detected by large-scale approaches were supported by 
more than one method. A biological solution for this ambiguousness 
is to use small scale methods to check the proteins one by one, 
validate the detected connections and find out the undetected ones. 
This is of course a tedious, time consuming job, which will take 
years if extensively done. A more feasible way is to introduce an 
intermediate step, in which interactions will be inferred in silico. In 
this setting, computational methods put forward a short list of most 
probable binds, which can then be experimentally verified. 



The wealth of biological data beside the interactomes themselves, 
hearten the use of supervised learning methods. Data sets such as 
gene expression (Eisen et al. 1998 ) , protein localization (Huh| 
\et ail 2003 ), signatures ( Apweiler. 2001 ) or phylogenetic profiles 
(Pellegrini. . |1999| ), can readily serve as bags of features for the 
inspected proteins. 

The most common approach to edge prediction via supervised 
learning is to train a binary classifier from available datasets. And 
use it to infer unknown interactions within this set of proteins 
( |Yamanishi grari[2004l ),( |Ben-Hur and Noble. | [2005] ). However, the 
features that are predictive, may differ across different families of 
proteins, or even change dramatically from one protein pair to the 
other. For example, gene-expression features measured in under 
amino-acid starvation condition may be very predictive for bio- 
synthesis proteins, but not for mitochondrial proteins. As a result, 
learning a single unified model across the full network may over 
generalize. 

A possible solution is to train a separate classifier for each 
protein (B leakley et al.\ |2007| ), which considerably narrows the 
amount of data that can be used to train each classifier. A possible 
alternative is to train a set of dependent classifiers that share data 
between associated proteins. For this purpose we developed COLor, 
a coordinated local ranker. COLor treats the edge prediction as a 
multi task ranking problem. It defines a separate learning problem 
for each protein in the network. For regularization, it constrains 
models of neighboring proteins to be similar, yielding a set of 
models that smoothly varies across the network. Since the number 
of models is large (equal to the number of nodes), we take an on- 
line large margin approach that is based on the Passive- Aggressive 
family of models, and scales to handle thousands of models. In 
this paper we examine ranking vs. global and local classification 
algorithms. We found COLoR to be much more precise than the 
global SVM. Unfortunately, for mid-size PPI networks, which are 
our networks of interest, the local classifiers outperform COLoR. 



2 METHODS 

2.1 The learning problem 

Let P be a set of proteins, pi , . . . , pn ^ and G = (P, E) be a graph 
that defines their pairwise interactions. Each edge in the graph e(pi,pj) G 
{0, 1} is a binary variable with a value 1 iff pi and pj interact. 

Our goal is to learn a scoring function Sw (pi , Pj ) with parameter w, that 
assigns a higher score to pairs that interact. 

S^^,{p,P^)>S^Jv{p,p-), (1) 
y{pi,p^,p~) ,e(p,p+) G E,e{p,p-) ^ E,Wpi G V 
In what comes below, we focus on bi-linear scoring functions of the form 
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Typically, a single scoring function is trained that is common to all pairs. 
However, the way features predict if two proteins interact may vary quite 
substantially in different parts of the network. 

To capture such variability we learn multiple models (scoring functions): 
a specific model for each node in the network, and a global scoring function 
that is common to all vertices. The combined scoring function will be - 



S^Jv{p, q)=/3p^W^q+{l 



(2) 



Where Wo is the global learner that is common to all interactions, is 
a local learner specific to the i^^ node, and /S is a trade-off parameter, that 
weights the contribution of the local and global models. When /5 = 0, the 
model reduces to the single-task learning problem. Following (^Grangier and] 
|Bengio| 2008 1 and |Chechik et 120101 , we minimize the following hinge 
loss for every triplet {pi,p~^ ,p~) 



: (0, 1 - ^w(p,P+) + S^Jv{p,p-)) (3) 



Naturally, the local models must be regularized so learning can generalize 
across pairs. We achieve this by defining a set of aij forcing the two models 
Wi and Wj to be close in L2 terms. This defines penalty terms of the form 
ck:ij||W^ — Wj||. When prior knowledge is available about which nodes 
should be similar, it can be used to set the a's. Alternatively, we set the a's 
based on the known network structure aij = e{pi, pj ) . This way, connected 
nodes are pushed to have similar models. 

The multi-task problem of optimizing all jointly is a large problem, 
since typical PPI networks include hundreds of nodes and edges. We 
therefore take an on-line approach for optimizing all W^. In this setting, 
it is natural to add another regularization term that forces W at each step 
to remain close to its previous value ||Wi* — Wi^~-^\\. Together, we 
therefore define an "influence set" for each vertex model that contains 
the neighbors of pi and an additional pseudo-neighbor W^*"^ which holds 
i's history. We denote by N(i) the extended set of "neighbors", N(i) = 
{j\eij = 1} U ihistory where ihistory refers to same node in the previous 
round. We set the weight of the pseudo-neighbor to a = 1. 



where 6p = (p+ — p ), r > and A > are Lagrange multipliers. To 

(6) 



find the optimal solution, we equate and to zero, and obtain 



Wi = Wj + NeTV 

Wo = Wo*"^+ry 
where Wj is the average of the neighbors W's, A^e is the number of 

neighbors, and V = ^^^^ = p(p~^ — p~)^ . 

Deriving the Lagrangian with respect to ^ and setting it to zero yields 



C-r- A=0 



(7) 



Plugging the above back into (5), and taking the derivative w.r.t. r 

1 n 

C{r) = ^^13 ^a,'-\\{W,+NerV)-Wjf (8) 

jeN(i) 

+ (1 - ^) i II (Wo^-i + tV) - Wo'-^ IP + Ce. 
+r^p^W]6p - (1 - (3)p^Wo'-^Sp 



-(l-^ + Are^)r||y|| 



which gives us the optimal r 



. IWiWo (Pi,P^:P ) 

rmn\C, - — - — — — — 

^ (1 + Are/3) ||y||2^ 



(9) 



We name this algorithm COLoR, Coordinated On-line Local Rankers. 
Algorithm 1 presents the pseudo code. 



2.2 A ranking algorithm for local models 

We describe an on-line multi-task learning algorithm based on the family of 
passive aggressive algorithms introduced by ^Crammer et al.\ \2006}. First, 
all Wi's are initialized at W^ = /. Then, at each iteration t, a protein pi 
is sampled, together with a protein p+ s.t. e(p,p~^) G E and a protein p~ 
that is not connected to p. This provides a triple [pi , p~^ , p~ ) for which we 
define the following optimization problem 

n 

min ^ ^«.I|W.-W,||^,, (4) 

+ ^mo-Wo'-'\\lro + C^^ 
S.t. < 

We follow ^Crammer et a/.||2006) to develop an algorithm for solving J4|. 
If l^^t = no update is needed. Otherwise, we define the Lagrangian 

jC (W,,WT,r,^,X)= (5) 

n 

+ ^^mO-Wo'-'\\lro+CC^ 

+r p^ (/3W, + (1 - /3) Wo) Sp] - Ae^ 



Algorithm 1 COLoR - Online algorithm for learning coordinated 

models 

Initialization: Initialize W?, Wi^ — 1. 

Iterations: 

repeat: 

Sample three proteins such that - 

Pr(e(p„p+) = > Pr{e{pi,p-^) = 1) 

Update W. = W, + NenV, Wo' = Wo*"' + nV 

Where n = min{C, ^^^^^^7^;^^^ } 



3 EXPERIMENTS 

We tested the algorithm on a network of PPIs describing interactions 
in the post synaptic density. The network was constructed from trusted 
scientific reports that describe interactions between proteins in the post 
synaptic density (PSD). Proteins that inherently lacked data for one or 
more feature sets. For example, proteins which genes are not included 
on the mouse expression genechip, were removed. The resulting network 
included 211 interactions between 114 proteins. To integrate the various 
sources, and to be compatible with other data sources (NCBI geo 
gene expression for example), networks vertices were represented by 
their gene symbols as specified in the Mouse Nomenclature guidehnes 
( http :// www, inf ormatic s .j ax . org/ mgihome/nomen/gene . shtml^ . 
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3.1 Protein features 

3.1.1 Gene Expression We downloaded 459 microarray expression 
profiles from NCBI Geo ( http : //www. ncbi . nlm. nih . g ov/geo/^ , all belong 
to NCBI GPL81 platform (Mus Musculus Affymetrix Murine Genome 
U74 Version) which measures expression profiles of 12488 genes. We 
retrieved datasets that hold the results of brain related experiments only. The 
expression data for each experiment results column was normalized using 
Cox-Box transformation and scaled using zero mean and unit variance. Data 
was downloaded from NCBI geo at Mar 28 2010. 

3.1.2 Phylo genie data Pairwise ortholog maps of 99 species were 
downloaded from the Inparanoid database ( http://inparanoid.cgb.ki. se/» For 
each gene we calculated ortho-score by multiplying of the gene's confidence 
score and the confidence level of this paralog cluster (ortholog group 
bootstrap value). We created a table of all Mus musculus genes, as given 
in MGI ( http://www.informatics.jax.0rg/l, and their ortho-scores against all 
other 98 orthologs. The orthoXML files were downloaded on December 1st 
2010. 

3.1.3 Protein domains and signature annotations We downloaded 
data from two databases, Interpro (http://www.ebi.ac.Uk/interpro/i, an 
integrative database of predictive models (signatures), and Pfam 
(http://pfam.sanger.ac.uk/), a repository of protein domains. In Pfam, we 
used the high-quality manually curated Pfam-A domains only. Overall we 
used 122 Pfam domains and 21178 signatures. XML files were downloaded 
from Pfam on June 2010, and from Interpro on January 2011. We used TF- 
IDF, a procedure borrowed from information retrieval jSalton and Mcgillj 
|1986) , to represent the domains and signatures by their weights. In our 
context, the domains or signatures serve as "terms", proteins as "documents" 
and the entire dataset as "corpus". 

3.1.4 Co-expression across brain structures We retrieved expression 
levels per structure, per gene, from the Alan Brain Atlas ( http -.//www. br ain- 1 
map.org/), which report expression levels across 17 different brain structures. 
For each gene we built a vector of 17 entries, each represent expression levels 
in the different brain structures. Search is done on-line. 

3.2 algorithms comparison 

We compared the performance of COLoR with global and local classifiers. 
Both classifiers are based on support vector machines ( ;Vapnik[ |1998} . 
The global SVM approach trains a single prediction model for the whole 
network. Local SVMs (Bleakley et al. 2007 1 trains an independent model 
for each vertex. To estimate the accuracy of the three approaches, we 
evaluate their predictions on held out data that was not used during training. 
Specifically, we use 5-fold cross-validation where at each fold, 80 percent of 
the proteins are used to train edge predictors and the remaining 20 percent are 
used to evaluate the precision of the learned classifier. Given a trained model, 
we used it to predict interactions between all candidate pairs of proteins 
and rank the pairs by the likelihood that they interact. We then computed 
the precision (fraction of truly interacting proteins) within the top-k ranked 
pairs. Figure 1 depicts the precision at top-k as a function of k for all the 
approaches. The local SVM approach is the most precise, COLoR is not 
as good, but it is more precise than the Global SVM at the top 40 ranked 
predictions. 

3.3 features comparison 

In order to evaluate the predictive power of different features, we examined 
a collection of microarray results of brain related experiments from NCBI 
GEO ( Edgar et al. 2002 1, domains from Pfam ( Bateman 2004), signitures 
from the Interpro (Apweiler 2001), ortholog maps from Inparanoid 
jOBrien.||2005), an d gene expression across brain structures from the Alan 
Brain Atlas jLeinTI |2007) . We found that the best precision was achieved 
when combining expression, domains, signatures and phylogenetic data. 
(Figure 2). Data from the Alan brain Atlas was non-predictive by itself, and 
obstructed the classification when combined with the other data sets. 
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Fig. 1. Precision at top k - COLoR, Global and Local SVMs. 
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Fig. 2. COLoR - Precision at top k for various features. 



The same feature composition was proved to be the most predictive for 
the Global and Local SVMs as well (Figures 3 and 4 respectively) 
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Fig. 3. Global SVM - Precision at top k for various features. 
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Fig. 4. Local SVM - Precision at top k for various feature vectors. 
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